11 research outputs found

    The Tag Filter Architecture: An energy-efficient cache and directory design

    Full text link
    [EN] Power consumption in current high-performance chip multiprocessors (CMPs) has become a major design concern that aggravates with the current trend of increasing the core count. A significant fraction of the total power budget is consumed by on-chip caches which are usually deployed with a high associativity degree (even L1 caches are being implemented with eight ways) to enhance the system performance. On a cache access, each way in the corresponding set is accessed in parallel, which is costly in terms of energy. On the other hand, coherence protocols also must implement efficient directory caches that scale in terms of power consumption. Most of the state-of-the-art techniques that reduce the energy consumption of directories are at the cost of performance, which may become unacceptable for high-performance CMPs. In this paper, we propose an energy-efficient architectural design that can be effectively applied to any kind of cache memory. The proposed approach, called the Tag Filter (TF) Architecture, filters the ways accessed in the target cache set, and just a few ways are searched in the tag and data arrays. This allows the approach to reduce the dynamic energy consumption of caches without hurting their access time. For this purpose, the proposed architecture holds the XX least significant bits of each tag in a small auxiliary X-bit-wide array. These bits are used to filter the ways where the least significant bits of the tag do not match with the bits in the X-bit array. Experimental results show that, on average, the TF Architecture reduces the dynamic power consumption across the studied applications up to 74.9%74.9%, 85.9%85.9%, and 84.5%84.5% when applied to L1 caches, L2 caches, and directory caches, respectively.This work has been jointly supported by MINECO and European Commission (FEDER funds) under the project TIN2015-66972-C5-1-R/3-R and by Fundación Séneca, Agencia de Ciencia y Tecnología de la Región de Murcia under the project Jóvenes Líderes en Investigación 18956/JLI/13.Valls, J.; Ros Bardisa, A.; Gómez Requena, ME.; Sahuquillo Borrás, J. (2017). The Tag Filter Architecture: An energy-efficient cache and directory design. Journal of Parallel and Distributed Computing. 100:193-202. https://doi.org/10.1016/j.jpdc.2016.04.016S19320210

    TokenTLB+CUP: A Token-Based Page Classification with Cooperative Usage Prediction

    Full text link
    [EN] Discerning the private or shared condition of the data accessed by the applications is an increasingly decisive approach to achieving efficiency and scalability in multi- and many-core systems. Since most memory accesses in both sequential and parallel applications are either private (accessed only by one core) or read-only (not written) data, devoting the full cost of coherence to every memory access results in sub-optimal performance and limits the scalability and efficiency of the multiprocessor. This paper introduces TokenTLB, a TLB-based page classification approach based on exchange and count of tokens. Token counting on TLBs is a natural and efficient way for classifying memory pages, and it does not require the use of complex and undesirable persistent requests or arbitration. In addition, classification is extended with Cooperative Usage Predictor (CUP), a token-based system-wide page usage predictor retrieved through TLB cooperation, in order to perform a classification unaffected by TLB size. Through cycle-accurate simulation we observed that TokenTLB spends 43.6% of cycles as private per page on average, and CUP further increases the time spent as private by 22.0%. CUP avoids 4 out of 5 TLB invalidations when compared to state-of-the-art predictors, thus proving far better prediction accuracy and making usage prediction an attractive mechanism for the first time.This work has been jointly supported by the MINECO and European Commission (FEDER funds) under the project TIN2015-66972-C5-1-R and TIN2015-66972-C5-3-R and the Fundacion Seneca-Agencia de Ciencia y Tecnologia de la Region de Murcia under the project Jovenes Lideres en Investigacion 18956/JLI/13.Esteve Garcia, A.; Ros Bardisa, A.; Robles Martínez, A.; Gómez Requena, ME. (2018). TokenTLB+CUP: A Token-Based Page Classification with Cooperative Usage Prediction. IEEE Transactions on Parallel and Distributed Systems. 29(5):1188-1201. https://doi.org/10.1109/TPDS.2017.2782808S1188120129

    PS-Dir: A Scalable Two-Level Directory Cache

    Full text link
    As the number of cores increases in both incoming and future chip multiprocessors, coherence protocols must address novel hardware structures in order to scale in terms of performance, power, and area. It is well known that most blocks accessed by parallel applications are private (i.e., accessed by a single core). These blocks present different directory requirements and behavior than shared blocks. Based on this fact, this paper proposes a two-level directory cache that tracks shared blocks in a small and fast first-level cache and private blocks in a larger and slower second-level cache, namely Shared and Private caches, respectively. Speed and area reasons suggest the use of eDRAM technology much dense but slower than SRAM technology for the Private cache, which in turn brings energy savings. Experimental results for a 16-core system show improvements in performance by 11.1%, in area by 25.4%, and in energy consumption by 20.5% compared to a conventional directory cache.This work has been supported by the Spanish MICINN, as well as European Commission FEDER funds, under Grants CSD2006-00046 and TIN2009-14475-C04Valls, JJ.; Ros Bardisa, A.; Sahuquillo Borrás, J.; Gómez Requena, ME.; Duato Marín, JF. (2012). PS-Dir: A Scalable Two-Level Directory Cache. IEEE Computer Society. https://doi.org/10.1145/2370816.2370891

    TLB-Based Temporality-Aware Classification in CMPs with Multilevel TLBs

    Full text link
    "© 2017 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works."[EN] Recent proposals are based on classifying memory accesses into private or shared in order to process private accesses more efficiently and reduce coherence overhead. The classification mechanisms previously proposed are either not able to adapt to the dynamic sharing behavior of the applications or require frequent broadcast messages. Additionally, most of these classification approaches assume single-level translation lookaside buffers (TLBs). However, deeper and more efficient TLB hierarchies, such as the ones implemented in current commodity processors, have not been appropriately explored. This paper analyzes accurate classification mechanisms in multilevel TLB hierarchies. In particular, we propose an efficient data classification strategy for systems with distributed shared last-level TLBs. Our approach classifies data accounting for temporal private accesses and constrains TLB-related traffic by issuing unicast messages on first-level TLB misses. When our classification is employed to deactivate coherence for private data in directory-based protocols, it improves the directory efficiency and, consequently, reduces coherence traffic to merely 53.0%, on average. Additionally, it avoids some of the overheads of previous classification approaches for purely private TLBs, improving average execution time by nearly 9% for large-scale systems.This work has been jointly supported by the MINECO and European Commission (FEDER funds) under the project TIN2015-66972-C5-1-R and TIN2015-66972-C5-3-R and the Fundacion Seneca-Agencia de Ciencia y Tecnologia de la Region de Murcia under the project Jovenes Lideres en Investigacion 18956/JLI/13.Esteve Garcia, A.; Ros Bardisa, A.; Gómez Requena, ME.; Robles Martínez, A.; Duato Marín, JF. (2017). TLB-Based Temporality-Aware Classification in CMPs with Multilevel TLBs. IEEE Transactions on Parallel and Distributed Systems. 28(8):2401-2413. https://doi.org/10.1109/TPDS.2017.2658576S2401241328

    Efficient TLB-Based Detection of Private Pages in Chip Multiprocessors

    Full text link
    © 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Most of the data referenced by sequential and parallel applications running in current chip multiprocessors are referenced by a single thread, i.e., private. Recent proposals leverage this observation to improve many aspects of chip multiprocessors, such as reducing coherence overhead or the access latency to distributed caches. The effectiveness of those proposals depends to a large extent on the amount of detected private data. However, the mechanisms proposed so far do not consider neither thread migration nor the private use of data within different application phases. As a result, a considerable amount of private data is not detected. In order to increase the detection of private data, we propose a TLB-based mechanism that is able to account for both thread migration and application phases. Simulation results show that the average number of pages detected as private significantly increases from 43 percent in previous proposals up to 79 percent in ours while keeping a reasonable TLB miss rate. Furthermore, when our proposal is used to deactivate the coherence for private data in a directory protocol, it improves execution time by 13.5 percent, on average, with respect to previous techniques.This work was jointly supported by the MINECO and European Commission (FEDER funds) under the project TIN2012-38341-C04-01/03 and the Fundacion Seneca-Agencia de Ciencia y Tecnologia de la Region de Murcia under the project Jovenes Lideres en Investigacion 18956/JLI/13. Albert Esteve is the corresponding author.Esteve García, A.; Ros Bardisa, A.; Gómez Requena, ME.; Robles Martínez, A.; Duato Marín, JF. (2016). Efficient TLB-Based Detection of Private Pages in Chip Multiprocessors. IEEE Transactions on Parallel and Distributed Systems. 27(3):748-761. https://doi.org/10.1109/TPDS.2015.2412139S74876127

    Increasing the effectiveness of directory caches by avoiding the tracking of noncoherent memory blocks

    Full text link
    © 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.A key aspect in the design of efficient multiprocessor systems is the cache coherence protocol. Although directory-based protocols constitute the most scalable approach, the limited size of the directory caches together with the growing size of systems may cause frequent evictions and, consequently, the invalidation of cached blocks, which jeopardizes system performance. Directory caches keep track of every memory block stored in processor caches in order to provide coherent access to the shared memory. However, a significant fraction of the cached memory blocks do not require coherence maintenance (even in parallel applications) because they are either accessed by just one processor or they are never modified. In this paper, we propose to deactivate the coherence protocol for those blocks that do not require coherence. This deactivation means directory caches do not have to keep track of noncoherent blocks, which reduces directory cache occupancy and increases its effectiveness. Since the detection of noncoherent blocks is carried out by the operating system, our proposal only requires minor hardware modifications. Simulation results show that, thanks to our proposal, directory caches can avoid the tracking of about 66 percent (on average) of the blocks accessed by a wide range of applications, thereby improving the efficiency of directory caches. This contributes either to shortening the runtime of parallel applications by 15 percent (on average) while keeping directory cache size or to maintaining performance while using directory caches 16 times smaller.This work was supported by the Spanish MICINN, Consolider Programme and Plan E funds, as well as European Commission FEDER funds, under Grants CSD2006-00046 and TIN2009-14475-C04-01. It was also partly supported by (PROMETEO from Generalitat Valenciana (GVA) under Grant ROMETEO/2008/060). B. Cuesta was with Universitat Politecnica de Valencia while working on this paper.Cuesta Sáez, BA.; Ros Bardisa, A.; Gómez Requena, ME.; Robles Martínez, A.; Duato Marín, JF. (2013). Increasing the effectiveness of directory caches by avoiding the tracking of noncoherent memory blocks. IEEE Transactions on Computers. 62(3):482-495. https://doi.org/10.1109/TC.2011.241S48249562

    PS-Cache: an energy-efficient cache design for chip multiprocessors

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s11227-014-1288-5Power consumption has become a major design concern in current high-performance chip multiprocessors, and this problem exacerbates with the number of core counts. A significant fraction of the total power budget is often consumed by on-chip caches, thus important research has focused on reducing energy consumption in these structures. To enhance performance, on-chip caches are being deployed with a high associativity degree. Consequently, accessing concurrently all the ways in the cache set is costly in terms of energy. This paper presents the PS-Cache architecture, an energy-efficient cache design that reduces the number of accessed ways without hurting the performance. The PS-Cache takes advantage of the private-shared knowledge of the referenced block to reduce energy by accessing only those ways holding the kind of block looked up. Experimental results show that, on average, the PS-Cache architecture can reduce the dynamic energy consumption of L1 and L2 caches by 22 and 40%, respectively.This work has been jointly supported by the MINECO and European Commission (FEDER funds) under the project TIN2012-38341-C04-01 and the Fundaci’on Seneca-Agencia de Ciencia y Tecnología de la Región de Murcia under the project Jóvenes Líderes en Investigación 18956/JLI/13.Valls, JJ.; Ros Bardisa, A.; Sahuquillo Borrás, J.; Gómez Requena, ME. (2015). PS-Cache: an energy-efficient cache design for chip multiprocessors. Journal of Supercomputing. 71(1):67-86. https://doi.org/10.1007/s11227-014-1288-5S6786711Balasubramonian R, Jouppi NP, Muralimanohar N (2011) Multi-core cache hierarchies. In: Synthesis lectures on computer architecture. Morgan & Claypool Publishers, San RafaelHennessy JL, Patterson DA (2011) Computer architecture, fifth edition: a quantitative approach, 5th edn. Morgan Kaufmann Publishers Inc., San FranciscoSinharoy B, Kalla R, Starke WJ, Le HQ, Cargnoni R, Van Norstrand JA, Ronchetti BJ, Stuecheli J, Leenstra J, Guthrie GL, Nguyen DQ, Blaner B, Marino CF, Retter E, Williams P (2011) IBM POWER7 multicore server processor. IBM J Res Dev 5(3):1:1-1:29 doi: 10.1147/JRD.2011.2127330Kaxiras S, Hu Z, Martonosi M (2011) 28th International symposium on computer architecture (ISCA), pp 240–251Flautner K, Kim NS, Martin S, Blaauw D, Kaxiras TM, Hu Z, Martonosi M (2002) 29th International symposium on computer architecture (ISCA), pp 148–157Ghosh M, Özer E, Ford S, Biles S, Lee HHS (2009) International symposium on low power electronics and design (ISLPED), pp 165–170Calder B, Grunwald D (1996) 2nd international symposium on high-performance computer architecture (HPCA) (1996), pp 244–253Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2009) 36th international symposium on computer architecture (ISCA), pp 184–195Cuesta B, Ros A, Gómez ME, Robles A, Duato J (2011) 38th international symposium on computer architecture (ISCA), pp 93–103Pugsley SH, Spjut JB, Nellans DW, Balasubramonian R (2010) 19th international conference on parallel architectures and compilation techniques (PACT), pp 465–476Hossain H, Dwarkadas S, Huang MC (2011) 20th international conference on parallel architectures and compilation techniques (PACT), pp 45–55Kim D, Kim JAJ, Huh J (2010) 19th international conference on parallel architectures and compilation techniques (PACT), pp 111–122Ros A, Kaxiras S (2012) 21st international conference on parallel architectures and compilation techniques (PACT), pp 241–252Sundararajan KT, Porpodas V, Jones TM, Topham NP, Franke B (2012) 18th international symposium on high-performance computer architecture (HPCA), pp 311–322Agarwal N, Peh LS, Jha NK (2009) 15th international symposium on high-performance computer architecture (HPCA), pp 67–78Cantin JF, Smith JE, Lipasti MH, Moshovos A, Falsafi B (2006) Coarse-grain coherence tracking: región scout and region coherence arrays. IEEE Micro 26(1):70–95Ferdman M, Lotfi-Kamran P, Balet K, Falsafi B (2011) 17th international symposium on high-performance computer architecture (HPCA), pp 169–180Zebchuk J, Srinivasan V, Qureshi MK, Moshovos A (2009) 42nd IEEE/ACM international symposium on microarchitecture (MICRO), pp 423–434Powell M, Hyun Yang S, Falsafi B, Roy K, Vijaykumar TN (2000) International symposium on low power electronics and design (ISLPED), pp 90–95Albonesi DH (1999) 32nd IEEE/ACM international symposium on microarchitecture (MICRO), pp 248–259Zhang C, Vahid F, Yang J, Najjar W (2005) A way-halting cache for low-energy high-performance systems. ACM Transactions on Architecture and Code Optimization. 2(1):34–54Ghosh M, Özer E, Biles S, Lee HHS (2006) 19th international conference on architecture of computing systems (ARCS), pp 283–297Lee J, Hong S, Kim S (2011) 17th international symposium on low power electronics and design (ISLPED), pp 85–90Kedzierski K, Cazorla FJ, Gioiosa R, Buyuktosunoglu A, Valero M (2010) 2nd international forum on next-generation multicore/manycore technologies, pp 1–12Alouani I, Niar S, Kurdahi F, Abid M (2012) 23rd IEEE international symposium on rapid system prototyping (RSP), pp 44–48Meng J, Skadron K (2009) International conference on computer design (ICCD), pp 282–288Li Y, Abousamra A, Melhem R, Jones AK (2010) 19th international conference on parallel architectures and compilation techniques (PACT), pp 501–512Li Y, Melhem RG, Jones AK (2012) 21st international conference on parallel architectures and compilation techniques (PACT), pp 231–240Alisafaee M (2012) 45th IEEE/ACM international symposium on microarchitecture (MICRO), pp 341–350Jiang G, Fen D, Tong L, Xiang L, Wang C, Chen T (2009) 8th international symposium on advanced parallel processing technologies. Springer, Berlin, pp 123–133Sundararajan K, Jones T, Topham N (2013) IEEE 31st international conference on computer design (ICCD), pp 294–301Valls JJ, Ros A, Sahuquillo J, Gómez ME, Duato J (2012) 21st international conference on parallel architectures and compilation techniques (PACT), pp 451–452Ros A, Cuesta B, Gómez ME, Robles A, Duato J (2013) 42nd international conference on parallel processing (ICPP), pp 562–571Jacob B, Ng S, Wang D (2007) Memory systems: cache, DRAM, disk, 4th edn. Morgan Kaufmann Publishers Inc., San FranciscoPatterson DA, Hennessy JL (2008) Computer organization and design: the hardware/software interface. The Morgan Kaufmann Series in Computer Architecture and Design, 4th edn. Morgan Kaufmann Publishers Inc., San FranciscoMagnusson PS, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt A, Werner B (2002) Simics: a full system simulation platform. IEEE Comput 35(2):50–58Martin MM, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator GEMS toolset. Comput Archit News 33(4):92–99Agarwal N, Krishna T, Peh LS, Jha NK (2009) IEEE international symposium on performance analysis of systems and software (ISPASS), pp 33–42Muralimanohar N, Balasubramonian R, Jouppi NP (2009) Cacti 6.0. Tech. Rep. HPL-2009-85, HP LabsWoo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) 22nd international symposium on computer architecture (ISCA), pp 24–36Li ML, Sasanka R, Adve SV, Chen YK, Debes E (2005) International symposium on workload characterization, pp 34–45Bienia C, Kumar S, Singh JP, Li K (2008) 17th international conference on parallel architectures and compilation techniques (PACT), pp 72–8

    Temporal-Aware Mechanism to Detect Private Data in Chip Multiprocessors

    Full text link
    © 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Most of the data referenced by sequential and parallel applications running in current chip multiprocessors are referenced by only one thread and can be considered as private data. A lot of recent proposals leverage this observation to improve many aspects of chip multiprocessors, such as reducing coherence overhead or the access latency to distributed caches. The effectiveness of those proposals depend to a large extent on the amount of detected private data. However, the mechanisms proposed so far do not consider thread migration and the private use of data within different application phases. As a result, a considerable amount of data is not detected as private. In order to make this detection more accurate and reaching more significant improvements, we propose a mechanism that is able to account for both thread migration and private data within application phases. Simulation results for 16-core systems show that, thanks to our mechanism, the average number of pages detected as private significantly increases from 43% in previous proposals up to 74% in ours. Finally, when our detection mechanism is used to deactivate the coherence for private data in a directory protocol, our proposal improves execution time by 13% with respect to previous proposals.This work was supported by the Spanish MINECO, as well as European Commission FEDER funds, under grant TIN2012-38341-C04-01/03 and by the VIRTICAL project (grant agreement no 288574) which is funded by the European Commission within the Research Programme FP7.Ros Bardisa, A.; Cuesta Sáez, BA.; Gómez Requena, ME.; Robles Martínez, A.; Duato Marín, JF. (2013). Temporal-Aware Mechanism to Detect Private Data in Chip Multiprocessors. En Proceedings of the International Conference on Parallel Processing. IEEE. 562-571. https://doi.org/10.1109/ICPP.2013.70S56257

    PS directory: a scalable multilevel directory cache for CMPs

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s11227-014-1332-5As the number of cores increases in current and future chip-multiprocessor (CMP) generations, coherence protocols must rely on novel hardware structures to scale in terms of performance, power, and area. Systems that use directory information for coherence purposes are currently the most scalable alternative. This paper studies the important differences between the directory behavior of private and shared blocks, which claim for a separate management of both types of blocks at the directory. We propose the PS directory, a two-level directory cache that keeps the reduced number of frequently accessed shared entries in a small and fast first-level cache, namely Shared cache, and uses a larger and slower second-level Private cache to track the large amount of private blocks. Entries in the Private cache do not implement the sharer vector, which allows important silicon area savings. Speed and area reasons suggest the use of eDRAM technology, much denser but slower than SRAM technology, for the Private cache, which in turn brings energy savings. Experimental results for a 16-core CMP show that, compared to a conventional directory, the PS directory improves performance by 14 % while reducing silicon area and energy consumption by 34 and 27 %, respectively. Also, compared to the state-of-the-art Multi-Grain Directory, the PS directory apart from increasing performance, it reduces power by 18.7 %, and provides more scalability in terms of area.This work has been jointly supported by the MINECO and European Commission (FEDER funds) under the project TIN2012-38341-C04-01 and the Fundacion Seneca-Agencia de Ciencia y Tecnologia de la Region de Murcia under the project Jovenes Lideres en Investigacion 18956/JLI/13.Valls, JJ.; Ros Bardisa, A.; Sahuquillo Borrás, J.; Gómez Requena, ME. (2015). PS directory: a scalable multilevel directory cache for CMPs. Journal of Supercomputing. 71(8):2847-2876. https://doi.org/10.1007/s11227-014-1332-5S28472876718Acacio ME, González J, García JM, Duato J (2001) A new scalable directory architecture for large-scale multiprocessors. In: Proceedings of 7th international symposium on high-performance computer architecture (HPCA), pp 97–106Acacio ME, González J, García JM, Duato J (2005) A two-level directory architecture for highly scalable cc-NUMA multiprocessors. IEEE Trans Parallel Distrib Syst (TPDS) 16(1):67–79Agarwal N, Krishna T, Peh L-S, Jha NK (2009) GARNET: a detailed on-chip network model inside a full-system simulator. In: Proceedings of IEEE international symposium on performance analysis of systems and software (ISPASS), pp 33–42Barroso LA, Gharachorloo K, McNamara R et al (2000) Piranha: a scalable architecture based on single-chip multiprocessing. In: Proceedings of 27th international symposium on computer architecture (ISCA), pp 12–14Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of 17th international conference on parallel architectures and compilation techniques (PACT), pp 72–81Chaiken D, Kubiatowicz J, Agarwal A (1991) LimitLESS directories: a scalable cache coherence scheme. In: 4th international conference on architectural support for programming language and operating systems (ASPLOS), pp 224–234Chen G (1993) Slid: a cost-effective and scalable limited-directory scheme for cache coherence. In: 5th international conference on parallel architectures and languages Europe (PARLE), pp 341–352Conway P, Kalyanasundharam N, Donley G, Lepak K, Hughes B (2010) Cache hierarchy and memory subsystem of the AMD opteron processor. IEEE Micro 30(2):16–29Cuesta B, Ros A, Gómez ME, Robles A, Duato J (2011) Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In: Proceedings of 38th international symposium on computer architecture (ISCA), pp 93–103Ferdman M, Lotfi-Kamran P, Balet K, Falsafi B (2011) Cuckoo directory: a scalable directory for many-core systems. In: 17th international symposium on high-performance computer architecture (HPCA), pp 169–180Guo S-L, Wang H-X, Xue Y-B, Li C-M, Wang D-S (2010) Hierarchical cache directory for cmp. J Comput Sci Technol 25(2):246–256Gupta A, Weber W-D, Mowry TC (1990) Reducing memory traffic requirements for scalable directory-based cache coherence schemes. In: Proceedings of international conference on parallel processing (ICPP), pp 312–321Kalla R, Sinharoy B, Starke WJ, Floyd M (2010) POWER7: IBMs next-generation server processor. IEEE Micro 30(2):7–15Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: Proceedings of 10th international conference on architectural support for programming language and operating systems (ASPLOS), pp 211–222Luk C-K, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wallace S, Reddi VJ, Hazelwood K (2005) Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of ACM SIGPLAN conference on programming language design and implementation (PLDI), June 2005, pp 190–200Magnusson PS, Christensson M, Eskilson J et al (2002) Simics: a full system simulation platform. IEEE Comput 35(2):50–58Martin MM, Sorin DJ, Beckmann BM et al (2005) Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. Comput Archit News 33(4):92–99Marty MR, Hill MD (2007) Virtual hierarchies to support server consolidation. In: Proceedings of 34th international symposium on computer architecture (ISCA), pp 46–56Marty MR, Hill MD (2008) Virtual hierarchies. IEEE Micro 28(1):99–109Matick RE, Schuster SE (2005) Logic-based eDRAM: origins and rationale for use. IBM J Res Dev 49(1):145–165Muralimanohar N, Balasubramonian R, Jouppi NP (2009) Cacti 6.0, HP Labs, technical report HPL-2009-85O’Krafka BW, Newton AR (1990) An empirical evaluation of two memory-efficient directory methods. In: Proceedings of 17th international symposium on computer architecture (ISCA), pp 138–147Ros A, Acacio ME, García JM (2010) A scalable organization for distributed directories. J Syst Archit (JSA) 56(2–3):77–87Ros A, Cuesta B, Fernández-Pascual R, Gómez ME, Acacio ME, Robles A, García JM, Duato J (2012) Extending magny-cours cache coherence. IEEE Trans Comput (TC) 61(5):593–606Sanchez D, Kozyrakis C (2012) SCD: a scalable coherence directory with flexible sharer set encoding. In: Proceedings of 18th international sympoium on high-performance computer architecture (HPCA), pp 129–140Shah M, Barreh J, Brooks J et al (2007) UltraSPARC T2: a highly-threaded, power-efficient, SPARC SoC. In: Proceedings of IEEE Asian solid-state circuits conference, pp 22–25Sinharoy B, Kalla RN, Tendler JM, Eickemeyer RJ, Joyner JB (2005) Power5 system microarchitecture. IBM J Res Dev 49(4/5):505–521Tendler JM, Dodson JS, Fields JS, Le H, Sinharoy B (2002) POWER4 system microarchitecture. IBM J Res Dev 46(1):5–25Valero A, Sahuquillo J, Petit S, Lorente V, Canal R, López P, Duato J (2009) An hybrid eDRAM/SRAM macrocell to implement first-level data caches. In: Proceedings of 42nd IEEE/ACM international symposium on microarchitecture (MICRO), pp 213–221Valls JJ, Ros A, Sahuquillo J, Gómez ME, Duato J (2012) PS-Dir: a scalable two-level directory cache. In: Proceedings of 21st international conference on parallel architectures and compilation techniques (PACT), pp 451–452Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of 22nd international symposium on computer architecture (ISCA), pp 24–36Wu X, Li J, Zhang L, Speight E, Rajamony R, Xie Y (2009) Hybrid cache architecture with disparate memory technologies. In: Proceedings of 36th international symposium on computer architecture (ISCA), pp 34–45Zebchuk J, Falsafi B, Moshovos A (2013) Multi-grain coherence directories. In: Proceedings of 46th IEEE/ACM international symposium on microarchitecture (MICRO), pp 359–370Zebchuk J, Srinivasan V, Qureshi MK, Moshovos A (2009) A tagless coherence directory. In: Proceedings of 42nd IEEE/ACM international symposium on microarchitecture (MICRO), pp 423–434Zhao H, Shriraman A, Dwarkadas S, Srinivasan V (2011) SPATL: Honey, I shrunk the coherence directory. In: Proceedings of 20th international conference on parallel architectures and compilation techniques (PACT), pp 148–15

    Extending magny-cours cache coherence

    Full text link
    One cost-effective way to meet the increasing demand for larger high-performance shared-memory servers is to build clusters with off-the-shelf processors connected with low-latency point-to-point interconnections like HyperTransport. Unfortunately, HyperTransport addressing limitations prevent building systems with more than eight nodes. While the recent High-Node Count HyperTransport specification overcomes this limitation, recently launched twelve-core Magny-Cours processors have already inherited it and provide only 3 bits to encode the pointers used by the directory cache which they include to increase the scalability of their coherence protocol. In this work, we propose and develop an external device to extend the coherence domain of Magny-Cours processors beyond the 8-node limit while maintaining the advantages provided by the directory cache. Evaluation results for systems with up to 32 nodes show that the performance offered by our solution scales with the number of nodes, enhancing the directory cache effectiveness by filtering additional messages. Particularly, we reduce execution time by 47 percent in a 32-die system with respect to the 8-die Magny-Cours configuration.This work was supported by the Spanish MICINN, Consolider Programme and Plan E funds, as well as European Commission FEDER funds, under Grants CSD2006-00046 and TIN2009-14475-C04-01/03. It was also partly supported by (PROMETEO from Generalitat Valenciana (GVA) under Grant PROMETEO/2008/060).Ros Bardisa, A.; Cuesta Sáez, BA.; Fernández-Pascual, R.; Gómez Requena, ME.; Acacio Sánchez, ME.; Robles Martínez, A.; García Carrasco, JM.... (2012). Extending magny-cours cache coherence. IEEE Transactions on Computers. 61(5):593-606. https://doi.org/10.1109/TC.2011.65S59360661
    corecore